pre-allocate storage for metadata json files, see containers/podman#13967 #1480

chilikk · 2023-01-25T22:02:22Z

This change seems to fix the scenario described in containers/podman#17198

I would appreciate advice on writing a testcase though, I would imagine that delete_container testcase would be a good starting point, however I lack the knowledge of the test environment and don't know if it is feasible to fill the filesystem or initialize a new filesystem and mount it.

…3967 Keeping a temporary file of at least the same size as the target file for atomic writes helps reduce the probability of running out of space when deleting entities from corresponding metadata in a disk full scenario. This strategy is applied to writing containers.json, layers.json, images.json and mountpoints.json. Signed-off-by: Denys Knertser <[email protected]>

rhatdan · 2023-01-26T19:45:11Z

@nalind @mtrmac @giuseppe @vrothberg @saschagrunert @flouthoc PTAL

mtrmac

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

mtrmac · 2023-01-26T20:09:56Z

pkg/ioutils/fswriters.go

+	random := ""
+	// need predictable name when pre-allocated
+	if !opts.PreAllocate {
+		random = strconv.FormatUint(rand.Uint64(), 36)


This is not a sufficient replacement for CreateTemp. I’m not immediately sure if this is called from contexts where that matters.

mtrmac · 2023-01-26T20:10:52Z

pkg/ioutils/fswriters.go

@@ -17,6 +19,9 @@ type AtomicFileWriterOptions struct {
 	// On successful return from Close() this is set to the mtime of the
 	// newly written file.
 	ModTime time.Time
+	// Whenever an atomic file is successfully written create a temporary
+	// file of the same size in order to pre-allocate storage for the next write
+	PreAllocate bool


The PreAllocate and !PreAllocate code paths differ so much, and very importantly they have different security implications (PreAllocate must not be used in directories with hostile writers, like */tmp), that I don’t think it makes much sense for them to use the same object at all.

The comparatively small parts of the code that can be shared between the two can be shared another way (e.g. using a shared helper function).

PreAllocate must not be used in directories with hostile writers

Not just that, PreAllocate is also only correct if there is some external-to-pkg/ioutils locking that ensures serialization of uses of AtomicFileWriter. (That’s fine for the container/image/layer stores, we have locks there. But it’s a new responsibility for callers of this code, and it needs to be very clearly documented.)

chilikk · 2023-01-26T21:05:14Z

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I can totally see your point regarding the complexity.

However, I do think that this is a useful feature to be able to remove specific containers/images once the disk is full, rather than the discussed alternative to perform a system reset dropping all containers with their state, volumes etc. It might not matter that much in a desktop environment (and could be fixed by freeing some less important files, as suggested, because the containers storage would often be located on a root partition), but in a production environment it is desirable to be able to limit the amount of affected services.

From what I have been able to gather so far Podman needs additional space on podman rm when 1) updating the Bolt database 2) updating the containers.json, layers.json and mountpoints.json. Assuming /var/lib/containers is already located on a dedicated partition, it is possible to mitigate (1) by setting up a dedicated partition for /var/lib/containers/storage/libpod. However, such solution is not applicable to (2) because the *.json files are located in the same directory as the actual data they describe.

Moreover, an atomic write of the *.json-files as it is implemented now requires roughly as much free space as the size of the original file which can easily go up to 100s of kBs even with a handful of containers. This means that the space requirements for such operation are much larger than a Bolt DB update meaning the likelihood of encountering an out of disk situation when the disk is low is much higher on metadata *.json atomic writes than updating the BoltDB.

By keeping a temporary file of the same size around we can guarantee that an atomic write of a smaller file (which is the case with podman rm) will succeed with regards to disk space. We cannot always be sure that a temporary file of the same size can always be allocated at write, so this is not a bullet-proof solution, but it still greatly reduces the probability of ending up in a situation where the containers and all their data must be re-created from scratch.

So, while I agree that one would try to avoid the complexity, I think it might be justified if it allows to solve the actual problem. Running out of disk space might not be a very often encountered scenario, but when it happens it can be very frustrating.

giuseppe · 2023-01-27T14:39:33Z

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I agree with @mtrmac. There are so many other places where it could go wrong that I am not sure it is worth to add all this complexity that is difficult to test just for one instance.

mtrmac · 2023-01-27T15:55:43Z

Or maybe … if we are serious about this, there really needs to be an end-to-end Podman tests that ensures the actually desired functionality (removing a container when completely out of disk space) keeps working.

(I realize that this is a bit circular — we need this PR, and maybe other changes, before that test can be added. So, that means either doing all that work, and then discussing whether it is worth maintaining it, or discussing that now, with some hypothetical guess at how much work that is.)

chilikk force-pushed the main branch from da47c10 to f3313af Compare January 25, 2023 22:05

chilikk mentioned this pull request Jan 25, 2023

cannot podman rm after reboot with fs full containers/podman#13967

Open

chilikk force-pushed the main branch from f3313af to 40551df Compare January 25, 2023 22:11

mtrmac reviewed Jan 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-allocate storage for metadata json files, see containers/podman#13967 #1480

pre-allocate storage for metadata json files, see containers/podman#13967 #1480

chilikk commented Jan 25, 2023

rhatdan commented Jan 26, 2023

mtrmac left a comment

mtrmac Jan 26, 2023

mtrmac Jan 26, 2023

mtrmac Jan 27, 2023

chilikk commented Jan 26, 2023

giuseppe commented Jan 27, 2023

mtrmac commented Jan 27, 2023

pre-allocate storage for metadata json files, see containers/podman#13967 #1480

Are you sure you want to change the base?

pre-allocate storage for metadata json files, see containers/podman#13967 #1480

Conversation

chilikk commented Jan 25, 2023

rhatdan commented Jan 26, 2023

mtrmac left a comment

Choose a reason for hiding this comment

mtrmac Jan 26, 2023

Choose a reason for hiding this comment

mtrmac Jan 26, 2023

Choose a reason for hiding this comment

mtrmac Jan 27, 2023

Choose a reason for hiding this comment

chilikk commented Jan 26, 2023

giuseppe commented Jan 27, 2023

mtrmac commented Jan 27, 2023